Re: general debugging techniques?

2010-07-02 Thread Jim Blomo
Just to confirm I'm not doing something insane, this is my general setup:

- index approx 1MM documents including HTML, pictures, office files, etc.
- files are not local to solr process
- use upload/extract to extract text from them through tika
- use commit=1 on each POST (reasons below)
- use optimize=1 every 150 documents or so (reasons below)

Through many manual restarts and modifications to the upload script,
I've got about half way (numDocs : 467372, disk usage 1.6G).  The
biggest problem is that any serious problem cannot be recovered from
without a restart to tomcat, and serious problems can't be
differentiated at the client level from non-serious problems (eg tika
exceptions thrown by bad documents).

On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo  wrote:
> In any case I bumped up the heap to 3G as suggested, which has helped
> stability.  I have found that in practice I need to commit every
> extraction because a crash or error will wipe out all extractions
> after the last commit.

I've also found that I need to optimize very regularly because I kept
getting "too many file handles" errors (though they usually came up as
the more cryptic "directory, but cannot be listed: list() returned
null" returned empty error).

What I am running into now is

SEVERE: Exception invoking periodic operation:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.substring(String.java:1940)
[full backtrace below]

After a restart and optimize this goes away for a while (~100
documents) but then comes back and every request after the error
fails.  Even if I can't prevent this error, is there a way I can
recover from it better?  Perhaps an option to solr or tomcat to just
restart itself if it hits that error?

Jim

SEVERE: Exception invoking periodic operation:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.substring(String.java:1940)
at java.lang.String.substring(String.java:1905)
at java.io.File.getName(File.java:401)
at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229)
at java.io.File.isDirectory(File.java:754)
at 
org.apache.catalina.startup.HostConfig.checkResources(HostConfig.java:1000)
at org.apache.catalina.startup.HostConfig.check(HostConfig.java:1214)
at 
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293)
at 
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120)
at 
org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306)
at 
org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570)
at 
org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579)
at 
org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559)
at java.lang.Thread.run(Thread.java:619)
Jul 3, 2010 1:32:20 AM
org.apache.solr.update.processor.LogUpdateProcessor finish


Re: general debugging techniques?

2010-07-06 Thread Jim Blomo
On Sat, Jul 3, 2010 at 1:10 PM, Lance Norskog  wrote:
> You don't need to optimize, only commit.

OK, thanks for the tip, Lance.  I thought the "too many open files"
problem was because I wasn't optimizing/merging frequently enough.  My
understanding of your suggestion is that commit also does merging, and
since I am only building the index, not querying or updating it, I
don't need to optimize.

> This means that the JVM spends 98% of its time doing garbage
> collection. This means there is not enough memory.

I'll increase the memory to 4G, decrease the documentCache to 5 and try again.

> I made a mistake - the bug in Lucene is not about PDFs - it happens
> with every field in every document you index in any way- so doing this
> in Tika outside Solr does not help. The only trick I can think of is
> to alternate between indexing large and small documents. This way the
> bug does not need memory for two giant documents in a row.

I've checked out and built solr from branch_3x with the
tika-0.8-SNAPSHOT patch.  (Earlier I was having trouble with Tika
crashing too frequently.)  I've confirmed that LUCENE-2387 is fixed in
this branch so hopefully I won't run into that this time.

> Also, do not query the indexer at all. If you must, don't do sorted or
> faceting requests. These eat up a lot of memory that is only freed
> with the next commit (index reload).

Good to know, though I have not been querying the index and definitely
haven't ventured into faceted requests yet.

The advice is much appreciated,

Jim


Re: Which Solr to use?

2010-05-21 Thread Jim Blomo
On Tue, May 18, 2010 at 12:31 PM, Sixten Otto  wrote:
> So features are being actively added to / code rearranged in
> trunk/4.0, with some of the work being back-ported to this branch to
> form a stable 3.1 release? Is that accurate?
>
> Is there any thinking about when that might drop (beyond the quite
> understandable "when it's done")? Or, perhaps more reasonably, when it
> might freeze?

I'm also interested in the recommend "testing" branch (to borrow a
Debian term) to use.  I'm planning a deployment in 2 months or so and
have been experiencing too many problems with the older version of
Tika to use the 1.4 version.

Jim


general debugging techniques?

2010-06-03 Thread Jim Blomo
I am new to debugging Java services, so I'm wondering what the best
practices are for debugging solr on tomcat.  I'm running into a few
issues while building up my index, using the ExtractingRequestHandler
to format the data from my sources.  I can read through the catalina
log, but this seems to just log requests; not much info is given about
errors or when the service hangs.  Here are some examples:

Some zip or Office formats uploaded to the extract requestHandler
simply hang with the jsvc process spinning at 100% CPU.  I'm unclear
where in the process the request is hanging.  Did it make it through
Tika?  Is it attempting to index?  The problem is often not
reproducible after restarting tomcat and starting with the last failed
document.

Although I am keeping document size under 5MB, I regularly see
"SEVERE: java.lang.OutOfMemoryError: Java heap space" errors.  How can
I find what component had this problem?

After the above error, I often see this followup error on the next
document: "SEVERE: org.apache.lucene.store.LockObtainFailedException:
Lock obtain timed out: NativeFSLock@/var/lib/solr/data/
index/lucene-d6f7b3bf6fe64f362b4d45bfd4924f54-write.lock" .  This has
a backtrace, so I could dive directly into the code.  Is this the best
way to track down the problem, or are there debugging settings that
could help show why the lock is being held elsewhere?

I attempted to turn on indexing logging with the line

true

but I can't seem to find this file in either the tomacat or the index directory.

I am using solr 3.1 with the patch to work with Tika 0.7.  Thanks for any tips,

Jim


Re: general debugging techniques?

2010-06-03 Thread Jim Blomo
On Thu, Jun 3, 2010 at 11:17 AM, Nagelberg, Kallin
 wrote:
> How much memory have you given tomcat? The default is 64M which is going to 
> be really small for 5MB documents.

-Xmx128M - my understanding is that this bumps heap size to 128M.
What is a reasonable size?  Are there other memory flags I should
specify?

Jim


Re: general debugging techniques?

2010-06-09 Thread Jim Blomo
On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter
 wrote:
> : That is still really small for 5MB documents. I think the default solr
> : document cache is 512 items, so you would need at least 3 GB of memory
> : if you didn't change that and the cache filled up.
>
> that assumes that the extracted text tika extracts from each document is
> the same size as the original raw files *and* that he's configured that
> content field to be "stored" ... in practice if you only stored=true the

Most times the extracted text is much smaller, though there are
occasional zip files that may expand in size (and in an unrelated
note, multifile zip archives cause tika 0.7 to hang currently).

> fast, 128MB is really, really, really small for a typical Solr instance.

In any case I bumped up the heap to 3G as suggested, which has helped
stability.  I have found that in practice I need to commit every
extraction because a crash or error will wipe out all extractions
after the last commit.

> if you are only seeing one log line per request, then you are just looking
> at the "request" log ... there should be more logs with messages from all
> over the code base with various levels of severity -- and using standard
> java log level controls you can turn these up/down for various components.

Unfortunately, I'm not very familiar with java deploys so I don't know
where the standard controls are yet.  As a concrete example, I do see
INFO level logs, but haven't found a way to move up DEBUG level in
either solr or tomcat.  I was hopeful debug statements would point to
where extraction/indexing hangs were occurring.  I will keep poking
around, thanks for the tips.

Jim