Hello everyone,
I have configured my 2 servers to run in distributed mode (with Hadoop) and
my configuration for crawling process is Nutch 2.2.1 - HBase (as a storage)
and Solr. Solr is run by Tomcat. The problem is everytime I try to do the
last step - I mean when I want to index data from HBase
I have written my own plugin for Apache Nutch 2.2.1 to crawl images, videos
and podcasts from selected sites (I have 180 urls in my seed). I put this
metadata to a hBase store and now I want to save it to the index (Solr). I
have a lot of metadatas to save (webpages + images + videos + podcast).
I
When I look into the log, there is:
SEVERE: auto commit error...:java.lang.IllegalStateException: this writer
hit an OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2668)
at
org.apache.lucene.index.IndexWriter.commitInte
Ok, I have removed the problem with OutOfMemory by increasing jvm
parameters... and now I have another problem. My index worked since
yesterday evening... the number of documents increased (I run bin/crawl
script every 3 hours and I have 27040 documents now).. but the last increase
was 6 hours ago.
As I can see, this is the same problem like one from older posts -
http://lucene.472066.n3.nabble.com/strange-utf-8-problem-td3094473.html
...but it was without any response.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-gr