Nutch + Solr - Indexer causes java.lang.OutOfMemoryError: Java heap space

2014-09-07 Thread glumet
Hello everyone, 

I have configured my 2 servers to run in distributed mode (with Hadoop) and
my configuration for crawling process is Nutch 2.2.1 - HBase (as a storage)
and Solr. Solr is run by Tomcat. The problem is everytime I try to do the
last step - I mean when I want to index data from HBase into Solr. After
then this *[1]* error occures. I tried to add CATALINA_OPTS (or JAVA_OPTS)
like this:

CATALINA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC -Xms1g -Xmx6000m
-XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m
-XX:+CMSClassUnloadingEnabled"

to Tomcat's catalina.sh script and run server with this script but it didn't
help. I also add these *[2]* properties to nutch-site.xml file but it ended
up with OutOfMemory again. Can you help me please?

*[1]*
/2014-09-06 22:52:50,683 FATAL org.apache.hadoop.mapred.Child: Error running
child : java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:587)
at java.lang.StringBuffer.append(StringBuffer.java:332)
at java.io.StringWriter.write(StringWriter.java:77)
at org.apache.solr.common.util.XML.escape(XML.java:204)
at org.apache.solr.common.util.XML.escapeCharData(XML.java:77)
at org.apache.solr.common.util.XML.writeXML(XML.java:147)
at
org.apache.solr.client.solrj.util.ClientUtils.writeVal(ClientUtils.java:161)
at
org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:129)
at
org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateRequest.java:355)
at
org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:271)
at
org.apache.solr.client.solrj.request.RequestWriter.getContentStream(RequestWriter.java:66)
at
org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getDelegate(RequestWriter.java:94)
at
org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getName(RequestWriter.java:104)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:247)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:96)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:650)
at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1793)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
/

*[2]*


  http.content.limit
  15000
  The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  For our purposes it is twice bigger than default - parsing big pages: 128
* 1024
  



   indexer.max.tokens
   10



  http.timeout
  5
  The default network timeout, in milliseconds.



  solr.commit.size
  100
  
  Defines the number of documents to send to Solr in a single update batch.
  Decrease when handling very large documents to prevent Nutch from running
  out of memory. NOTE: It does not explicitly trigger a server side commit.
  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-Solr-Indexer-causes-java-lang-OutOfMemoryError-Java-heap-space-tp4157308.html
Sent from the Solr - User mailing list archive at Nabble.com.


Apache Solr 4 - after 1st commit the index does not grow

2013-07-14 Thread glumet
I have written my own plugin for Apache Nutch 2.2.1 to crawl images, videos
and podcasts from selected sites (I have 180 urls in my seed). I put this
metadata to a hBase store and now I want to save it to the index (Solr). I
have a lot of metadatas to save (webpages + images + videos + podcast).

I am using Nutch script bin/crawl for the whole process (inject, generate,
fetch, parse... and finally solrindex and dedup) but I have one problem.
When I run this script for a first time, there are stored approximately 6000
documents (Lets say it is 3700 docs for images, 1700 for wegpages and the
rest of docs are for videos and podcasts) to the index. It is ok...

but...

When I run the script for a second time, third time and so on... the index
does not increase the number of documents (there are still 6000 documents)
but a count of rows stored in hBase table grows (there is 97383 rows now)...

Do you now where is the problem please? I am fighting with this problem
really long time and I dont know... If it could be helpful, this is my
configuration of solrconfix.xml http://pastebin.com/uxMW2nuq and this is my
nutch-site.xml http://pastebin.com/4bj1wdmT



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Apache Solr 4 - after 1st commit the index does not grow

2013-07-14 Thread glumet
When I look into the log, there is:

SEVERE: auto commit error...:java.lang.IllegalStateException: this writer
hit an OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2668)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2834)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2814)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:529)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913p4077924.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Apache Solr 4 - after 1st commit the index does not grow

2013-07-15 Thread glumet
Ok, I have removed the problem with OutOfMemory by increasing jvm
parameters... and now I have another problem. My index worked since
yesterday evening... the number of documents increased (I run bin/crawl
script every 3 hours and I have 27040 documents now).. but the last increase
was 6 hours ago... why it 
stoped to grow again?

You can look at my solr here:
http://ir-dev.lmcloud.vse.cz:8082/solr/#/~logging

The log says:

java.lang.RuntimeException: [was class java.io.CharConversionException]
Invalid UTF-8 character 0x at char #2800441, byte #3096524)

What is it? how can I solve it? Does anyone have any idea?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913p4078077.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Apache Solr 4 - after 1st commit the index does not grow

2013-07-15 Thread glumet
As I can see, this is the same problem like one from older posts -
http://lucene.472066.n3.nabble.com/strange-utf-8-problem-td3094473.html
...but it was without any response.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913p4078079.html
Sent from the Solr - User mailing list archive at Nabble.com.