Re: general debugging techniques?
Just to confirm I'm not doing something insane, this is my general setup: - index approx 1MM documents including HTML, pictures, office files, etc. - files are not local to solr process - use upload/extract to extract text from them through tika - use commit=1 on each POST (reasons below) - use optimize=1 every 150 documents or so (reasons below) Through many manual restarts and modifications to the upload script, I've got about half way (numDocs : 467372, disk usage 1.6G). The biggest problem is that any serious problem cannot be recovered from without a restart to tomcat, and serious problems can't be differentiated at the client level from non-serious problems (eg tika exceptions thrown by bad documents). On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo wrote: > In any case I bumped up the heap to 3G as suggested, which has helped > stability. I have found that in practice I need to commit every > extraction because a crash or error will wipe out all extractions > after the last commit. I've also found that I need to optimize very regularly because I kept getting "too many file handles" errors (though they usually came up as the more cryptic "directory, but cannot be listed: list() returned null" returned empty error). What I am running into now is SEVERE: Exception invoking periodic operation: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.String.substring(String.java:1940) [full backtrace below] After a restart and optimize this goes away for a while (~100 documents) but then comes back and every request after the error fails. Even if I can't prevent this error, is there a way I can recover from it better? Perhaps an option to solr or tomcat to just restart itself if it hits that error? Jim SEVERE: Exception invoking periodic operation: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.String.substring(String.java:1940) at java.lang.String.substring(String.java:1905) at java.io.File.getName(File.java:401) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229) at java.io.File.isDirectory(File.java:754) at org.apache.catalina.startup.HostConfig.checkResources(HostConfig.java:1000) at org.apache.catalina.startup.HostConfig.check(HostConfig.java:1214) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120) at org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559) at java.lang.Thread.run(Thread.java:619) Jul 3, 2010 1:32:20 AM org.apache.solr.update.processor.LogUpdateProcessor finish
Re: general debugging techniques?
On Sat, Jul 3, 2010 at 1:10 PM, Lance Norskog wrote: > You don't need to optimize, only commit. OK, thanks for the tip, Lance. I thought the "too many open files" problem was because I wasn't optimizing/merging frequently enough. My understanding of your suggestion is that commit also does merging, and since I am only building the index, not querying or updating it, I don't need to optimize. > This means that the JVM spends 98% of its time doing garbage > collection. This means there is not enough memory. I'll increase the memory to 4G, decrease the documentCache to 5 and try again. > I made a mistake - the bug in Lucene is not about PDFs - it happens > with every field in every document you index in any way- so doing this > in Tika outside Solr does not help. The only trick I can think of is > to alternate between indexing large and small documents. This way the > bug does not need memory for two giant documents in a row. I've checked out and built solr from branch_3x with the tika-0.8-SNAPSHOT patch. (Earlier I was having trouble with Tika crashing too frequently.) I've confirmed that LUCENE-2387 is fixed in this branch so hopefully I won't run into that this time. > Also, do not query the indexer at all. If you must, don't do sorted or > faceting requests. These eat up a lot of memory that is only freed > with the next commit (index reload). Good to know, though I have not been querying the index and definitely haven't ventured into faceted requests yet. The advice is much appreciated, Jim
Re: Which Solr to use?
On Tue, May 18, 2010 at 12:31 PM, Sixten Otto wrote: > So features are being actively added to / code rearranged in > trunk/4.0, with some of the work being back-ported to this branch to > form a stable 3.1 release? Is that accurate? > > Is there any thinking about when that might drop (beyond the quite > understandable "when it's done")? Or, perhaps more reasonably, when it > might freeze? I'm also interested in the recommend "testing" branch (to borrow a Debian term) to use. I'm planning a deployment in 2 months or so and have been experiencing too many problems with the older version of Tika to use the 1.4 version. Jim
general debugging techniques?
I am new to debugging Java services, so I'm wondering what the best practices are for debugging solr on tomcat. I'm running into a few issues while building up my index, using the ExtractingRequestHandler to format the data from my sources. I can read through the catalina log, but this seems to just log requests; not much info is given about errors or when the service hangs. Here are some examples: Some zip or Office formats uploaded to the extract requestHandler simply hang with the jsvc process spinning at 100% CPU. I'm unclear where in the process the request is hanging. Did it make it through Tika? Is it attempting to index? The problem is often not reproducible after restarting tomcat and starting with the last failed document. Although I am keeping document size under 5MB, I regularly see "SEVERE: java.lang.OutOfMemoryError: Java heap space" errors. How can I find what component had this problem? After the above error, I often see this followup error on the next document: "SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/ index/lucene-d6f7b3bf6fe64f362b4d45bfd4924f54-write.lock" . This has a backtrace, so I could dive directly into the code. Is this the best way to track down the problem, or are there debugging settings that could help show why the lock is being held elsewhere? I attempted to turn on indexing logging with the line true but I can't seem to find this file in either the tomacat or the index directory. I am using solr 3.1 with the patch to work with Tika 0.7. Thanks for any tips, Jim
Re: general debugging techniques?
On Thu, Jun 3, 2010 at 11:17 AM, Nagelberg, Kallin wrote: > How much memory have you given tomcat? The default is 64M which is going to > be really small for 5MB documents. -Xmx128M - my understanding is that this bumps heap size to 128M. What is a reasonable size? Are there other memory flags I should specify? Jim
Re: general debugging techniques?
On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter wrote: > : That is still really small for 5MB documents. I think the default solr > : document cache is 512 items, so you would need at least 3 GB of memory > : if you didn't change that and the cache filled up. > > that assumes that the extracted text tika extracts from each document is > the same size as the original raw files *and* that he's configured that > content field to be "stored" ... in practice if you only stored=true the Most times the extracted text is much smaller, though there are occasional zip files that may expand in size (and in an unrelated note, multifile zip archives cause tika 0.7 to hang currently). > fast, 128MB is really, really, really small for a typical Solr instance. In any case I bumped up the heap to 3G as suggested, which has helped stability. I have found that in practice I need to commit every extraction because a crash or error will wipe out all extractions after the last commit. > if you are only seeing one log line per request, then you are just looking > at the "request" log ... there should be more logs with messages from all > over the code base with various levels of severity -- and using standard > java log level controls you can turn these up/down for various components. Unfortunately, I'm not very familiar with java deploys so I don't know where the standard controls are yet. As a concrete example, I do see INFO level logs, but haven't found a way to move up DEBUG level in either solr or tomcat. I was hopeful debug statements would point to where extraction/indexing hangs were occurring. I will keep poking around, thanks for the tips. Jim